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Abstract 

We consider the problem of inferring the probability distribution associated with a language, given 
£jr)| data consisting of an infinite sequence of elements of the languge. We do this under two assumptions on 

the algorithms concerned: (i) like a real-life algorothm it has round-off errors, and (ii) it has no round- 
off errors. Assuming (i) we (a) consider a probability mass function of the elements of the language if 
the data are drawn independent identically distributed (i.i.d.), provided the probability mass function is 
computable and has a finite expectation. We give an effective procedure to almost surely identify in the 
limit the target probability mass function using the Strong Law of Large Numbers. Second (b) we treat 
the case of possibly incomputable probabilistic mass functions in the above setting. In this case we can 
only pointswize converge to the target probability mass function almost surely. Third (c) we consider 
the case where the data are dependent assuming they are typical for at least one computable measure 
and the language is finite. There is an effective procedure to identify by infinite recurrence a nonempty 
subset of the computable measures according to which the data is typical. Here we use the theory of 
Kolmogorov complexity. Assuming (ii) we obtain the weaker result for (a) that the target distribution is 
identified by infinite recurrence almost surely; (b) stays the same as under assumption (i). We consider 
the associated predictions. 

I. Introduction 

In cognition and science one learns by observation. The perceptual system of an individual person, or 
the data-gathering resources of a scientific community, incrementally gathers empirical data, and attempts 
to find the structure in that data. The question arises: under what conditions is it possible precisely to 
infer the structure underlying those observations? Or, relatedly, under what conditions could a machine 
learning algorithm potentially precisely recover this structure? We can model this problem as having 
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the following form: given a semi-infinite sequence of samples from a probability distribution, under 
what conditions is it possible precisely to recover this distribution? Moreover, we can focus on the case 
where each observation is coded in a language — that is, each observation corresponds to an element 
of a countable set of sentences. Then the problem at hand is to recover the probability induced by the 
language. 

In the context of the cognitive processes of an individual, information from the senses is presumed to 
be coded in the brain to some finite precision (indeed, neural firing is discrete, lTl9l ). Linguistic input, in 
particular, can be coded in a hierarchy of discrete symbolic representations, as described by generative 
grammar (for example, lfT2l ). And, in the context of the operation of the scientific community, data is 
digitally coded to finite precision in symbolic codes. There are at least three reasons for scepticism that 
precisely recovering the probability of possible observations is possible. 

First, as observed by Popper ifTTll . however much data has been encountered, any theory or model can 
be falsified by the very next piece of data. However many white swans are observed, there is always 
the possibility that the very next swan will be black, or some more unlikely color. If, in the case of 
a child learning a language, however often the child encounters sentences following a particular set of 
grammatical rules, it is always possible that the very next sentence encountered will violate these rules 
(for example [ToTl). Thus, however, much data been encountered, there is no point at which the learner can 
announce a particular probability as correct with any certainty. But this does not rule out the possibility 
that the learner might learn to identify the correct probability in the limit. That is, perhaps the learner 
might make a sequence of guesses, finally locking on to correct probability and sticking to it forever — 
even though the learner can never know for sure that it has identified the correct probability successfully. 
We shall therefore consider identification in the limit below (following, for example, 0, (H, ifTolD . 

Second, in conventional statistics, probabilistic models are typically idealized as having continuous 
valued parameters; and hence there is an uncountable number of possible probabilities, from which the 
correct probability is to be recovered. In general it is impossible that a learner can make a sequence of 
guesses that precisely locks on to the correct values of continuous parameters. This, since the possible 
strategies of learners are effective in the sense of Turing |[20l and thus countable. This assumption is, 
of course, obeyed by any practical machine learning algorithm and we assume also by the brain. The 
set of such strategies can express only a countable number of possible hypotheses. From this mild 
assumption, it is, of course, immediately evident that the overwhelming majority of a continuum of 
hypotheses cannot be represented, let alone learned. Moreover, there is a particularly natural restriction 
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concerning the set of probabilites from which the learner chooses: it must be computable (these notions 
are made precise below). This seems a reasonable restriction, both in cognitive science and scientific 
methodology. After all, the assumption that individual human information processing is computable is a 
founding assumption of cognitive science (see for example, EH); and the same constraint arguably applies 
to every practically usable scientific theory (although whether this holds is discussed in [2]). We shall 
see below that restricting ourselves to the computable simplifies the problem of precisely reconstructing 
the correct probability from the observed data. For example, the computability of the set of possible 
probabilities means that these can be enumerated; and it may then be possible to gradually home in the 
correct probability, by successively eliminating earlier ones in the enumerated list. As we see below, it 
is also possible to provide approximation results if the computability restriction is dropped. 

A third reason for initial scepticism also concerns computability — this time for the learner, not just 
the probability to be recovered. Even if there is, in principle, sufficient data to pin down the correct 
probability precisely, there remains the question of whether there is a feasible computational procedure 
that can reliably map from data to a sequence of guesses that eventually lock on to the correct probability. 
Real-life computational procedures are finite and always have round-off errors. We outline positive results 
that can be obtained for computable learners with or without such round-off errors. 

A. Preliminaries 

A language is a set of sentences. The learnability of a language under various computational 
assumptions is the subject of an immensely influential approach in @ and especially Q. But surely 
in the real world the chance of one sentence of a language being used is different from another one. For 
example, many short sentences have a larger chance of turning up than very long sentences. Thus, the 
elements of a given language are distributed in a certain way. There arises the problem of identifying or 
approximating this distribution. 

We first introduce some terminology. A function is computable, if there is a Turing machine (or 
any other equivalent computational device such as a universal programming language) that maps the 
arguments to the values. We say that we identify a function / in the limit if we effectively produce an 
infinite sequence /i, /a, • • • of functions and fi = f for all but finitely many i. This corresponds to the 
notion of "identification in the limit" in [5]. We identify a function / by infinite recurrence if fi = / for 
infinitely many i. A sequence of functions converges to a function / pointswize if Hindoo fi(a) = f(a) 
for all a in the domain of /. The functions we are interested in are versions of the probability mass 
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functions. 

The restriction to computable probability mass functions (or more generally any restriction to a 
countable set of probability mass functions) is both cognitively realistic (if we assume language is 
generated by a computable process) and dramatically simplifies the problem of language identification. 
This is also the case for the use of algorithms with round-off errors. 

B. Related work 

In 12 (citing previous more restricted work) a target probability mass function was identified in the 
limit when the data are drawn independent identically distributed (i.i.d.) in the following setting. Let the 
target probability mass function p be an element of a list qi, q%, . . . subject to the following conditions: (i) 
every qi : N — > 1Z + is a probability mass function where M and 1Z + denote the positive natural numbers 
and the positive real numbers, respectively; (ii) there is a total computable function C(i,x,e) = r such 
that (qi(x) — r) < e with r, e > are rational numbers. The technical means used are the Law of the 
Iterated Logarithm and the Kolmogorov-Smirnov test. The algorithms used have no round-off errors. 
However, the list qi,q2,- •• cannot contain all computable probability mass functions, Lemma 4.3.1 in 

El. 

C. Results 

In Section|II]we deal with probability mass functions. For technical reasons we introduce a weaker form 
thereof called "semiprobability mass functions." Consider a probability mass function satisfying dA.ll ) 
below associated with a language. The data consist of an infinite sequence of elements of this language that 
are drawn i.i.d. The aim is to identify the probability mass function given the data. (In contrast to ID we 
allow all computable probability mass functions.) In Section ITl-AI we consider algorithms without round- 
off errors. Then, we identify the target distribution by infinite recurrence almost surely. In Section III-BI 
the identification algorithm is subject to round-off errors and we identify the target distribution in the 
limit almost surely (underpinning the result announced in Q). In Section JII] we treat the case of possibly 
uncomputable probability mass functions. Then we can only show pointswize convergence almost surely. 
This result holds both with or without round-off errors. The technical tool in these sections is the Strong 
Law of the Large Numbers. In all these results the language concerned can be infinite. In Section [IV] 
we consider the case where the data are dependent assuming they are typical (Definition [9]) for at least 
one computable measure. In contrast to the i.i.d. case, it is possible that the data are typical for many 
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measures. The language concerned is finite and the identification algorithm has round-off errors. Then, 
we identify by infinite recurrence (possibly a subset) of computable measures for which the data are 
typical. The technical tool is Kolmogorov complexity. Finally, In Section [V] we consider the associated 
predictions. We defer the proofs of the theorems to Appendix |A] 

II. Computable Probability Mass Functions 

Most known probability mass functions are computable provided their parameters are computable. In 
order that it is computable we only require that the probability mass function is finitely describable and 
there is an effective process producing it lHOl . 

It is known that the overwhelming majority of real numbers are not computable. An example of 
an incomputable probability mass function therefore is the one associated with a biased coin with an 
incomputable probability p of outcome heads and probability 1 — p of outcome tails, < p < 1. On the 
other hand, if p is lower semicomputable, then we can effectively find nonnegative integers ai,a2, . . . 
and b\, hi, • • ■ such that a n /b n < a n+ i/b n+ \ and linin^oo a n /b n = p. Let us generalize this observation. 

Definition 1: If a function has as values pairs of nonnegative integers, such as (a, b), then we can 
interpret this value as the rational a/b. This leads to the notion of a computable function with rational 
arguments and real values. A real function f(x) with x rational is semicomputable from below if it 
is defined by a rational-valued total and computable function (f){x, k) with x a rational number and 
k a nonnegative integer such that (j)(x,k + 1) > (j)(x,k) for every k and rimfc_ ) . 00 (j)(x, k) = f(x). 
This means that / can be computably approximated arbitrary close from below (see |[T4l . p. 35). A 
function / is semicomputable from above if — / is semicomputable from below. If a real function is both 
semicomputable from below and semicomputable from above then it is computable. A function / is a 
semiprobability mass function if Y2 x f( x ) < 1 and it is a probability mass function if ^2 x f(x) = 1. It 
is customary to write p(x) for f(x) if the function involved is a semiprobability mass function. 

We cannot effectively enumerate all computable probability mass functions (this is a consequence 
of Lemma 4.3.1 in |[T4l ). However, it is possible to enumerate all and only the semiprobability mass 
functions that are lower semicomputable. This is done by fixing an effective enumeration of all Turing 
machines of the so-called prefix type. (Such an enumeration is quite the same as effectively enumerating 
all programs in a conventional computer programming language that is computationally universal and 
were the programs are prefix-free — a set is prefix-free if no element is a proper prefix of any other. Most, 
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if not all, conventional computer programming languages satisfy these requirements.) It is possible to 
change every Turing machine description in the enumeration into one that computes a semiprobability 
mass function that is computable from below, Theorem 4.3.1 in lfl4l (originally in 12T1 . ifLSTO . The result 
is 

Q = qi,Q2,..., (HI) 

a list containing all and only semiprobability mass functions that are semicomputable from below. Without 
loss of generality every element of Q is over the alphabet L. 

Definition 2: There is a total and computable function <f>(i,x,t) = q\{x) such that (p(i,x,t) < 
<f)(i, x,t + l) and lim^co q\(x) = 

Every probability mass function is a semiprobability mass function, and every computable probability 
mass function is semicomputable from below. Therefore, every computable probability mass function is in 
list Q. Indeed, every such function will be in the list infinitely often, which follows simply from the fact 
that there are infinitely many computer programs that compute a given function. If a lower semicomputable 
semiprobability mass function is a probability mass function, then it must be computable, [ 14] Example 
4.3.2. Therefore, every probability mass function in the list is computable. It is important to realize that, 
although the description of every computable probability mass function is in list Q, it may be there in 
lower semicomputable format and we may not know it is computable. 

A. Algorithms Without Round-Off Errors 

Definition 3: Let x = x\, X2, ■ ■ ■ be an infinite sequence of elements of the language L i.i.d. drawn 
according to a computable probability mass function p. Let Q(j>) be defined as the set of indices of 
elements of Q that are copies of p. 

Theorem 1: Computable I.I.D. Probability Identification (No Round-Off) Let L be a 
language {ai,ct2, •••} (a countably finite or infinite set) with a computable probability mass function 
p. Without loss of generality we assume that every a G L is a finite integer. Let the mean of p exist 
(J2aeL a P( a ) < °°)- T ne algorithm in the proof takes as input an infinite sequence x = x±,X2,--- of 
elements of L drawn i.i.d. according to p. After processing x n the algorithm computes as output the 
index i n of a semiprobability mass function in the enumeration Q. Define Qoo as the set of indices of 
elements in Q that appear infinitely often in the sequence produced by the algorithm. Then, 

(i) Qoo 7^ almost surely; 

(ii) if i € Qoo then i G Q(p) almost surely; and 
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(iii) almost surely lim inf n_x>o in = minQ(p). 

B. Algorithms With Round-Off Errors 

We will want to computationally separate probability mass functions p for which Y2 X p(x) = 1 from 
semiprobability mass functions q for which q(x) < 1 in the list Q. 

Definition 4: To deal with truncation errors in the above (in)equalities we use the following notation. 

The truncation error is a known additive term ±e, e > 0, and we denote (in)equalities up to this truncation 

+ + + + + 

error by <, >, =, <, and >. Every function that satisfies the probability mass function equality within 
the round-off error is viewed as a probability mass function. 

Let L be a language {oi, ct2, • • •}, #a(xi,X2, ■ ■ ■ , x n ) be the number of elements in xi, X2, ■ ■ ■ , x n equal 
a G L, and k' be the least index of an element in the list Q such that 



Theorem 2: Computable I.I.D. Probability Identification (Round-Off) Let L be a lan- 
guage {ai,a,2, ■ ■ ■} (a countably finite or infinite set) with a computable probability mass function p. 
Without loss of generality we assume that every a G L is a finite integer. Let the mean of p exist 
(SaeL a P( a ) < °°)- The algorithm in the proof takes as input an infinite sequence x = x\,x>z,... of 
elements of L drawn i.i.d. according to p. After processing x n the algorithm computes as output the 
index i n of a semiprobability mass function in the enumeration Q. There exists an iV such that i n = k' 
for all n> N. We have k' < k with k as in Theorem [TJ 



Can we get rid of the restriction that the probability mass function be computable? Above we used the 
computability to consider a well-ordered countable list of lower semicomputable semiprobability mass 
functions, and pin-pointed the least occurrence of the target in the list. We use the fact that we have a 
guarantee that the target is in the list. If we have no such guarantee (possibly there is no list), we can 
still converge pointswise to the empirical probability mass function of the language L based on the data 
xi,X2, ■ ■ ■ ■ We do this by an algorithm computing the probability p(a) in the limit for all a G L. 

Note that this is quite different from Theorems [TJ [2l There we indicated (the least occurrence of) p 
precisely in a well-ordered list even though the result only holds "almost surely." In contrast, here we find 




III. General Probability Mass Functions 
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p(a) for all a € L in the limit only. Moreover, this holds also "almost surely." However, the probability 
mass functions considered here consist of all probability mass functions on L — computable or not. 

Theorem 3: I.I.D. Probability Approximation With An Algorithm Without Round-Off 
Error Let L be a language {a±, a 2 , . . .} (a countably finite or infinite set) with a computable probability 
mass function p. Without loss of generality we assume that every a € L is a finite integer. Let the mean 
of p exist C^2 a( z L ap(a) < oo). The algorithm in the proof takes as input an infinite sequence x\,X2, ■ ■ ■ 
of elements of L drawn i.i.d. according to p. After processing x n the algorithm computes as output a 
probability mass function p n . Almost surely for all a S L the lim ra _ >00 p n (a) = p(a). 
With round-off error this theorem is about the same. 

Remark 1: Can this result be strengthened to a form of dependent variables? It all depends on 
whether the Strong Law of Large Numbers holds. We know from P- 474, that the Strong Law of 
Large Numbers holds for stationary ergodic sources with finite expected value. O 



This time let the language L = {a%, a^, ■ ■ ■ , a m } be a finite set, and x\, X2, ■ ■ • the data consisting of an 
infinite sequence of elements from L. We drop the requirement of independency of the different elements 
in the data. Thus, we assume that our data sequence x\,x%,... is possibly dependent. This implies that 
the probability model for L is more general than i.i.d.. In fact, we will allow all computable measures on 
the infinite sequences of elements from L. Thus, the probability model used includes stationary processes, 
ergodic processes, Markov processes of any order, and other models, provided they are computable. 

Given a finite sequence x = x±,X2, ■ ■ ■ ,x n of elements of the basic set L, we consider the set of infinite 
sequences starting with x. The set of all such sequences is written as F x , the cylinder of x. We associate 
a probability ^(T^) with the event that an element of T x occurs. Here we abbrieviate n{T x ) to 
The transitive closure of the intersection, union, complement, and countable union of cylinders gives a 
set of subsets of L°°. The probabilities associated with these subsets are derived from the probabilities 
of the cylinders in standard ways Q. A semimeasure /i satisfies the following: 



and if equality holds instead of each inequality we call fi a measure. Using the above notation, a 
semimeasure ji is lower semicomputable if it is defined by a rational-valued computable function 



IV. Computable Measures 



(IV. 1) 
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4>(x, k) with x G L* and fc a nonnegative integer such that 4>(x, k + 1) > </>(x, fe) for every A; and 
linifc-j.oo cj)(x, k) = n{x). This means that fi can be computably approximated arbitrary close from below 
for each argument x G L*. 

To separate measures from semimeasures that are not measures using an algorithm subject to round-off 
errors, we have to deal with truncation errors in the (semi)measure (in)equalities. This truncation error 

is a known additive term ±e, e > 0. Again we use Definition |4j (in)equalities up to this truncation error 

+ + + + + 
are denoted by <, >, =, <, and >. 

In the argument below we want to effectively enumerate all computable measures. By Lemma 4.5.1 

of lfl4l . if a lower semicomputable semimeasure is a measure, then it is computable. Thus it suffices 

to effectively enumerate all lower semicomputable measures. By Lemma 4.5.2 of the cited reference 

this is not possible. But if we effectively enumerate all lower semicomputable semimeasures, then this 

enumeration includes all lower semicomputable measures which are a fortiori computable measures. 

This turns out to be possible (originally in ll2~Tll . H3). Just as in the case of lower semicomputable 

semiprobabilities, but with a little more effort, we can effectively enumerate all and only lower 

semicomputable semimeasures, as is described in the proof of Theorem 4.5.1 of lfl4l . This goes by taking 

a standard enumeration of all Turing machines T\, T2, ... of the so-called monotone type. Subsequently 

we transform every Turing machine in the list to one that lower semicomputes a semimeasure. Also, it 

is shown that all lower semicomputable semimeasures are in the list. The result is 

M = m,H2,.... (IV.2) 

This list contains all and only semimeasures that are semicomputable from below. It is important to 
realize that, although the description of every computable measure is in list M., it may be there in lower 
semicomputable format and we may not know it is computable. 

Definition 5: There is a total and computable function <p(i,x,t) = n\(x) such that 4>(i,x,t) < 
<j)(i, x,t + l) and lim^oo fj%(x) = (m(x). 

Remark 2: Let our data be x\, X2, ■ ■ ■ ■ Possibly there are none or more than one computable measures 
that have these data as a "typical" sequence. Here we use "typicality" according to the Definition [9] 
below. Such typical infinite sequences are also called "random" with respect to the measure involved. 
For instance, let the data sequence be 0, 0, ... . Then this is a typical sequence of the computable measure 
that gives measure 1 to every initial segment of this data sequence. But if we consider /Zj 2 that gives 
measure ^ to every initial segment of 0, 0, . . . and also measure \ to every initial segment of 1, 1, ... , 
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then the considered data sequence is also a typical sequence of /ij 2 . Similarly, it is a typical sequence 
of the measure m h that gives measure 1/k to every initial segment of 0, 0, . . . through k — 1, k — 1, . . . 

{k < oo). 

Hence our task is not to identify the measure according to which the data sequence was generated as 
a typical sequnce, but to identify measures which could have generated the data sequence as a typical 
sequence. Note that we have reduced our task from identifying the computable measure generating the 
data, which is not possible, to some computable measures that could have generated the data as a typical 
sequence. O 

Remark 3 : We assume here that L is finite. This is no genuine restriction since all real natural or 
artificial languages that ever existed contain less than, say, 10 100 elements. This is supported by the fact 
that 10 100 far exceeds the number of atoms currently believed to exist in the observable universe. We 
assume that L is finite since it makes the computation of ^2 aeL xa for x G L* effective. 

Where appropriate, we shall use the (in)equalities according to Definition [4] This has bad and good 
effects. The bad effect is that semimeasures that violate the measure equalities by at most the truncation 
error are counted as measures. The good effects are, besides computationally verifiable (in)equalities, that 
in the infinite processes to construct lower semicomputable semimeasures (the proof of Theorem 4.5.1 
of lfT4lQ we only need to consider finite initial segments. O 
Let L be a language with a computable measure fi on the infinite sequences of its elements. We recall 
the notion that an infinite sequence x is "typical" for \x. 

Definition 6: Let x = xi,X2, ■ ■ ■ be an infinite sequence of elements of the language L. The infinite 
sequence x is typical or random for a computable measure \i if it passes all effective sequential tests for 
randomness with respect to \i in the sense of Martin-Lof lfl31l . The set of such sequences have /x-measure 
one. We define M[x) as the set of indices of elements of the list M. that are computable measures fi 
such that x is typical for fj,. 

Theorem 4: Computable Measure Identification By Algorithms With Round-Off Er- 
rors Let L = {oi,02, • • • , a m }, m < oo, be a language and let x\,X2,. ■ ■ with Xi € L be an infinite data 
sequence. Assume that the data sequence is typical for at least one computable measure. The algorithm 
in the proof takes as input the infinite sequence x±,X2, ■ ■ ■ ■ After processing x n the algorithm computes 
as output the index i n of an element of M.. Define M m as the set of indices of elements in A4 that 
appear infinitely often in the sequence produced by the algorithm. Then, C M(x), M m ^ 0, and 
| Moo | < oo. 
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V. Prediction 

In Sections [TT] and [III] the data are drawn i.i.d. according to a probability mass function p on the 
elements of L. Given p, we can predict the probability p{a\x\, . . . , x n ) that the next draw results in an 
element a when the previous draws resulted in x\, . . . ,x n . (The same holds in appropriate form for a 
good pointswise approximation p of p.) 

For measures as in Section [TV] allowing dependent data, the situation is quite different. In the first 
place there can be many measures that have x = x\, X2, ■ ■ ■ as typical (random) data. In the second place, 
different of these measures may give different probability predictions using the same initial segment of 
x. 

Let us give a simple example. Suppose the data is x = a, a, ... . This data is typical for the measure /xi 
defined by ^\{x) = 1 for every x consisting of a finite or infinite string of a's and fii(x) = otherwise. 
But the data is also typical for ^2 which gives probability /j,2(x) = \ for every string consisting of an a 
followed by a finite or infinite string of a's, or a followed by a finite or infinite string of 6's. 

Firstly, [i\ is not equal to «2, even though x is typical for both of them. Secondly, fii(a\a) = 1. But 
M2(a|a) = /i2(o|a) = §■ In fact, ^i(y|a) = 1 for every y consisting of a finite or infinite string of a's, 
and otherwise. The conditional probability fi 2 (y\a) is \ for y consisting of a finite or infinite string of 
a's or y consisting of a finite or infinite string of 6's, and otherwise. Thus, different measures for which 
the data is typical may give very different predictions. With respect to predictions we can only proceed 
as follows: (i) find one or more measures for which the data is typical, and (ii) predict according to one 
of these measures that we select. 

It does not seem make sense to make a weighted prediction according to the measures for which the 
data is typical. There may not be a single measure among them making that prediction. Moreover, the 
consecutive data resulting from many predictions may not be typical with respect to any of the original 
measures. 

The question arises how the i.i.d. case and the measure case relate to one another. The answer is as 
follows. For ease of writing we will ignore the adjective "computable." It is clear that the i.i.d. probabilities 
are a subset of the more general case of measures. Containment is proper since there is a measure that 
is not an infinite sequence of i.i.d. draws according to a probability mass function. An example is given 
below. 

Moreover, an infinite sequence of data can be typical for more than one measure, even though if such 
a sequence is typical for an infinite sequence of i.i.d. draws of any probability mass function, then it 
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is typical for an infinite sequence of i.i.d. draws of only this single probability mass function. Thus, if 
the data is typical for different measures, then only one of the measures involved is a probability mass 
function. 

Let us give an example. The measure [i\ above has a single typical sequence x = a, a, . . . . This measure 
results from infinitely many i.i.d. draws of L according to the probability mass function p{a) = 1 and 
p(z) = for z € L\{a}. But x is also typical for /i2, a measure which has no i.i.d. case that corresponds 
to it. This can be seen as follows. The measure [12 has also the typical sequence y = a, b, b, . . . , a sequence 
such that no infinite number of i.i.d. draws of any probability mass function corresponds to it. Namely, the 
probability of a and b must both be non-zero to yield the sequence y. Hence a typical infinite sequence 
must contain (in the i.i.d. case) infinitely a's and 6's. But y does not do so. 

Thus, the i.i.d. case is a proper subset of the measure case. A single infinite sequence can be typical 
for many (infinitely many) measures, even though if it is typical for an i.i.d. case it is typical for only a 
single one of the probability mass functions. But there are infinite sequences that are typical for many 
measures but not typical for any case of infinitely many i.i.d. draws according to a probability mass 
function. 

Appendix 

Proof: OF Theorem [T| Our data is, by assumption, generated by i.i.d. draws according to a 
computable probability mass function p satisfying (1A.11 ). Formally, the data xi,x%, ... is generated by a 
sequence of random variables X\ , X% , ■ ■ ■ , each of which is a copy of a single random variable X with 
probability mass function P(X = a) = p(a) for every a € L. Without loss of generality p(a) > for 
all a € L. The mean of X exists by (1A.1I) . 

Remark 4: In probability theory the statement almost surely means "with probability one." Let us 
illustrate this notion. It is possible that a fair coin generates an infinite sequence 0,0,... even though 
the probability of 1 at each trial is i. The uniform measure of the set of infinite sequences, such that 
the relative frequency of l's goes to the limit \, is one. Call sequences in that set pseudo typical. The 
probability that an infinite sequence is pseudo typical is one, even though there are infinite sequences (like 
in the example above) that are not pseudo typical. Thus, "almost surely" may not mean "with certainty." 

O 

The Strong Law of Large Numbers (originally in iPTOll ) states that if we perform the same experiment 
a large number of times, then almost surely the average of the results goes to the expected value. That 
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is, if a mild condition is satisfied. We require that our sequence of random variables X\ , X% , . . . satisfies 
Kolmogorov's criterion that 

\ - Oi) 2 

i 

where af is the variance of Xj in the sequence of mutually independent random variables X\ , X2, ■ ■ ■ ■ 
Since all X,'s are copies of a single X, all Xj's have a common distribution p. In this case we use the 
theorem on top of page 260 in |6|. To apply the Strong Law in this case it suffices that the mean of X 
exists. (We denote this mean by /x, not to be confused with the notation of measures /x(-) we use below.) 
That is, we require that 

= ap(a) < 00. (A.l) 

Then, the Strong Law states that 

Pr ( lim - Vjj = a] =1, 

\ n— >oo n *■ — ' / 

\ i=l / 

or (1/n) Yli=i converges almost surely to /x as n —> 00. 

To determine the probability of an a G L we consider the related random variables X a with just two 
outcomes {a, a}. This X a is a Bernoulli process (q, 1 — q) where q = p(a) is the probability of a and 
1 — q = XlbeL\{a} P(fy i s tne probability of a. If we set a = min (L \ {a}) while the probability of a is 
1 - q = EbeL\{a} P( b )> then the mean of X a is 

H a = aq + a(l - q) < \i. 



Remark 5: Recall that L may be infinite, that is, L = {a\, 02, •••,}■ Then a priori it could be that 
Hindoo /i a . = 00. But we have just proved that not only this does not happen but even fj, a . < (jl for 
every j. O 

Thus, every a € -L incurs a random variable X a for which the equivalent of (IA. lb applies. Therefore, 
according to the cited theorem the quantity {1/n) Y^i=l^")i conver g es almost surely to fj, a as n — > 00. 
Therefore, almost surely 



lim y 



#aj(xi,x 2 , ...,x n ) 



n 



0. 



(A.2) 
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Algorithm (x±, x 2 , ■ ■ ■): 
Step 1 for n = 1, 2, . . . execute Steps 2 through 4. 
Step 2 Set 7 := 0; for i := 1, 2, . . . , n execute Step 3. 
Step 3 set 7^ := 1; L: if £^ =1 |gf (ay) - #a j (x 1 ,x 2 , ...,x n )/n\ < 1/7^. 

then (7 i)n := 7 i)T1 + 1; goto L) else 7 i( „ := 7^ - 1; 

if 7i,n > 1i,n-l, ■ ■ ■ ,7i,l then ^ := ^ U {*}■ 

Step 4 if 7 7^ then i n := min7 else i n := i n -\\ output i n . 



Fig. 1. Algorithm la 



With q ^ p substituted for p in the lefthand side of dA.21 ), we have that this left-hand side is almost 
surely unequal 0. Using some probability theory, we can rewrite the Strong Law using only the finite 
initial segment of the infinite sequence of (copies of) the two-outcome random variables. We use (||6), p 
258 ff). For every pair e > and 5 > 0, there is an N such that there is a probability 1 — 5 or better 
that for every r > all r + 1 inequalities: 

#a(xi,X2, ■■■,x n ) 



p{a) 



n 



< e, (A3) 



with n = N, N + 1, N + r will be satisfied with probability at least 1 — 5. That is, we can say, 
informally, that with overwhelming probability the left-hand part of (IA.3I ) remains small for all n > N . 
Since we deal with all infinite outcomes of i.i.d. draws from the set L according to p, for some sequences 
that are not pseudo-typical (Remark [4]) the inequality (IA.3b does not hold. For example, always drawing 
ai while p(a{) = \. Therefore, the Strong Law holds "almost surely" and cannot hold "with certainty." 

Definition 7: Let k be the least index of an element of Q such that = p. For every i with 
1 < i < k, max ae £ \qk(a) — > (this follows from the minimality of k). Define 

a = min max |g&(a) — <ft(a)|. 

l<i<k aeL 

Then, a > 0. Let a* be the a that reaches the maximum in max ag i |</fc(«) — qi(a)\ for 1 < i < k, 
and P = maxi<i<it{j : a,j G L & aj = a 1 }. For every z with 1 < i < k, let tj be least such that 
qi(cij) — q\(aj) < a/2 for every t > ti and j < (3. Define r by r = maxi<j<fctj. 

The sequence of outputs of the algorithm is ii,*2>>> - such that possibly < ij+i, ij > ij+i, or 

= ij+i. Recall that Qoo = {i '■ in = « for infinitely many n}. 

Claim 1: (i) Qoo ^ almost surely; (ii) if i G then j € Q(p) almost surely; (iii) almost surely 
liminf^oo i n = minQ(p). 
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Proof: The algorithm outputs a sequence ij.,i2> ■ ■ ■ • For no > 3 max{A;, (3, r} > A; + /3 + r the 
algorithm has considered somewhere along n = 1, . . . ,rto the approximations gf (a 1 ), . . . , g^-iC^ -1 ) 
in Step 3. By Definition [TJ these lower approximations are within a/2 of the final value of the qi{a % ) 
(1 < i < A). Moreover, again by Definition [71 these final values differ at least by a from the values of 
p = qk for all arguments a 1 , . . . , a k ~ 1 , respectively. Hence the approximations q\ (a 1 ), . . . , 
differ at least by a/2 from q^a 1 ), . . . ,qk{a k ~ l ), respectively, for every t > r. However, Algorithm 1 
(Figure [TJ does not know k. We have to show how the algorithm handles this information. 
Since 



lim 

n— >oo 



, . _ #a(xi,x 2 , ■ ■ ■ ,x n ) 







almost surely for every a £ L by (1A.2K it follows from the above that 



lim 

n— >oo 



t, n #a l (x 1 ,x 2 ,. . . ,x n ) 



(a*) 



a 

2 j, 



n 

almost surely (1 < i < A;). This means that for large enough n we have almost surely that l/7i, n > a/3 
(with a > a constant) in Step 3 of the algorithm (1 < i < k). Then, 7j n < 3/a and there is an ni such 
that for all n > n« we have 7j n < 7i . . . , 74,1 (1 < i < A;). Thus, for large enough n and 1 <i < k 
almost surely we have i / in Step 3. 

Almost surely we have jk,n > 7fc,n-i> • • • , 7fe,i for infinitely many n in Step 3 since lim n _ ) . 0O 1/7^ = 
almost surely by (IA.2b . Namely, q^ is the probability mass function according to which x\, X2, ■ ■ ■ is i.i.d. 
drawn from L. This means that almost surely k is put in / for infinitely many n in Step 3. Moreover, 
almost surely i / for every n > rii and 1 < i < k as we have seen above. Thus, for infinitely many n 
we have k = min / and the output in Step 4 is i n = k, almost surely. Therefore k £ almost surely. 
For 1 < i < k almost surely i Qoo by the above argument. Hence, almost surely k = minQoo and 
k = lim infn^oo i n . This shows item (i) and, together with Definition |7J item (iii). 

For every i > k such that there are infinitely many n for which 7^ > 7i )Tl _i, . . . ,7^1 in Step 3 
we have linin^oo l/7i, n = 0. This again means that qi is almost surely the probability mass function 
according to which x\,X2, ... is i.i.d. drawn from L. Hence i £ Q(p) almost surely. Clearly, i is put 
in / for infinitely many n. If there are also infinitely many n such that i = mini, then i £ Qoo- Thus 
Qoo ^ Q{p) almost surely. This shows item (ii). ■ 

■ 

Proof: OF Theorem [2] Up to, and exclusive of Definition [7J we follow the proof of Theorem [Q 
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Algorithm (x±, X2, ■ ■ ■): 
Step 1 for n = 1, 2, . . . execute Steps 2 and 3. 
Step 2 Set / := 0; for i := 1, 2, . . . , n: 

if Ej=i - #a>j(xi,X2, ...,x n )/n\ = then / := 

Step 3 if / 7^ then i n := mini else i n := i n -i\ output i n . 

Fig. 2. Algorithm lb 

Algorithm (xi, x 2 , • • •): 
Step 1 for n := 1, 2, . . . execute Step 2. 

Step 2for every a G L occurring in x\,X2, ■ ■ ■ ,x n set p n (a) := #a(x\,X2, ...,x n )/n. 
Fig. 3. Algorithm 2 



Since lim^oo qf(aj) = qi(aj) for every i, j > 1, by (IA.2b almost surely 

#aj(xi,X2, ... ,x n 



lim N 



qk'{aj) n 



n 



0. 



for a A;' satisfying A;' < k. Almost surely, the above displayed equation does not hold for i (1 < i < k') 
by similar reasoning as in the proof of Theorem [TJ Hence there is an iV such that for all n > N we 
have k! G / and i ^ I for every i (1 < i < k') in Step 2 of Algorithm lb. Consequently, in Step 3 of 
the algorithm i n = k! for all n > N. ■ 
Proo/.- of Theorem [3] 

Algorithm 2 (Figure [3]) together with the Strong Law of Large Numbers shows that lim„_ s . 0O p n {a) = 
p(a) almost surely for every a G L. Here p is the probability mass function of L based on the data 
sequence x\,X2, ■ ■ ■ ■ Note that in Algorithm 2 the different values of p n sum to precisely 1 for every 
n = l,2, .... ■ 
Proof: OF Theorem |4] We need the theory of Kolmogorov complexity lfl4l (originally in 01] and 
the prefix version we use here in lTT3lD . A prefix Turing machine is one with a one-way read-only input 
tape with an distinguished tape cell called the origin, a working tape that is a two-way read-write tape on 
which the computation takes place, and a write-only output tape. At the start of the computation the input 
tape is infinitely inscribed from the origin onwards, and the input head is on the origin. The machine 
operates with binary input. If the machine halts then the input head has scanned a segment of the input 
tape from the origin onwards. We call this initial segment the program. 
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By this definition the set of programs is a prefix code: no program is a proper prefix of any other 
program. Consider a standard enumeration of all prefix Turing machines T\, T2, .... Let U denote a 
universal Turing machine such that for every z G {0, 1}* and % > 1 we have U(i, z) = Ti{z). That is, for 
all finite binary strings z and every machine index i > 1, we have that C/'s execution on inputs i and z 
results in the same output as that obtained by executing Xi on input z. There are infinitely many such U's. 
Fix one such a U (and with some abuse of notation denote it as U henceforth) and define that conditional 
prefix complexity K{x\y) for all x,y G {0, 1}* by K(x\y) = min p {|p| : p G {0, 1}* and U(p,y) = x}. 
For the same U, define the time-bounded conditional prefix complexity K l {x\y) = min p {|p| : p G 
{0, 1}* and U(p,y) = x in t steps}. To obtain the unconditional versions of the prefix complexities set 
y = A where A is the empty word (the word with no letters). 

By definition the sets over which the minimum is taken are countable and not empty. It can be shown 
that K(x\y) is incomputable. Clearly K l {x\y) is computable if t is computable. Moreover, K l (x\y) < 
^(xly) for every t' > t, and lim^oo K l {x\y) = K(x\y). Since everything is discrete, there is a least 
time t x \ y < 00 such that K l ^ y (x\y) = K(x\y), even though the function f(x, y) defined by f(x, y) = t x \ y 
for all 1,1/6 {0, 1}* may be incomputable. 

Definition 8: The language L = {a\, 02, . . . , a m } is finite. We view a G L as an integer and a < 00. 
If xi, X21 ■ ■ ■ 1 x n is a data sequence with X{ G L (1 < i < n), then K{x\ . . . x n \y) = min p {|p[ : p G 
{0, 1}* and U(p,y) = xi- n } where x\- n is an agreed-upon binary encoding of x\X2 ■ ■ ■ x n . Similarly we 
define K f (xi . . . x n \y). 

We now turn to the theory of semicomputable semimeasures. In particular we exhibit a formal criterium 
that an infinite sequence is "typical" or "random" in Martin-Lof 's sense lfl31 . 

Definition 9: Let p, be a computable measure on the set of infinite sequences of elements from L. 
A particular infinite sequence x = x\, X2, ■ ■ ■ G L°° is typical or random for \i if 

sup{log — - — - K (xi . . . x n \fi)} < 00, 

n fl(xi...X n ) 

In |[T4l the definition is different, but is equivalent to the above one by Corollary 4.5.2 there. The measure 
p in the conditional of K (-|-) means a finite number of bits that constitute a program that describes fi. 
Clearly, K(/j,) < 00 since p is computable. Moreover, according to |[14l we have K(x\y) > K(x) — 
K{y) + 0(\) for every finite x and y. Therefore, K{x\ . . . x n ) — K(p) + 0(l) < K{x\ . . . x n \p) + 0(i) < 
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K{x\ . . . x n ) + 0(1). Hence we can replace the last displayed formula by 

sup{log— r -K(x 1 ...x n )} < oo. (A.4) 

„ fi(X 1 ...X n ) 

Our data is, by assumption, typical (equivalently random) for some computable measure That is, the 
data x\, X2, ■ ■ • satisfies (IA.4I ) with respect to fi. We can effectively enumerate all and only semimeasures 
that are semicomputable from below as the elements listed in M. of (IIV.2b . 

Remark 6: We stress that the data is possibly /i-random and //-random for different measures fi and 
//. In general it can be so for many measures in M.. Therefore we cannot speak of the true measure, but 
only of a measure for which the data is typical. O 

To eliminate the undesirable lower semicomputable semimeasures among /ii,/X2 • • • we sieve out the 
ones that are not measures, and among the measures the ones that do not show x±,X2, ■ ■ ■ random to it. 

To do so, we conduct for elements of M. a test for both properties. Since the test is computational we 

+ + + + + 

need the (in)equality relations <, >, =, <, > of Definition |4] In particular this is needed in the properties 
in Claim [3j These properties are used in Step 3 of Algorithm 3 (Figure |4]). 

Definition 10: Since the algorithms have a round-off error, we can test only = or not = 0. 
Consequently, we count semimeasures as measures if they satisfy dIV.ll ) but deviate from equalities by a 
very small additive term only. More precisely, ^ in the list A4 is counted as a semimeasure but not as 

a measure if Hi{z) — XlaeL ^i{ za ) > 0- If ^i{ z ) ~ SaeL f 1 i( za ) — tnen we view Mi as a measure. 

Claim 2: Assume that /x, in M is a semimeasure but not a measure according to Definition [TO] Then, 
there is a least z € L* and a least rtj such that /J-f(z) — ^2 a€L ^f(za) > for every n>rii. 

Proof: Since fii is lower semicomputable and \L\ < oo, for every zeL* there is an n z such that 
for every n > n z we have 

0<IH(z)-fl(z)±0 

and for every a G L we have 

0<5^ W (zo)-5^/i?(zo)±0. 

Therefore, 

■ 

Definition 11: Let \ii be an element of M.. Assume that for all n we have i,j > and i + 



18 



j = n. Define Z iJ>n = {z € L* : \z\ < j, fif(z) - Y, a eL^i(. za ) > °i and = max{A : 

Zij-A,n-A D " " " D %i,j,n }- 

Remark 7: The set Zij n contains all strings z G L* of at least length j such that i + j = n and 
^f( z ) ~ J2a£L f jL f( za ) > 0- The intersection 2ij-A,n-A f] ' ' ' D ^i,j,n is the set of all strings z of 
length at least j — A that witness that the approximations ^™~ A , . . . are all semimesures but not 
measures. The quantity Aj n is the maximum number of approximations before and including the nth 
approximation of Uj such that the same z € L* of length at least j — Aj )n with i + j = n witnesses that 
all these approximations are semimeasure but not measures: fj,f' (z) > XlaeL A*? ' i za ) f° r ever y n' such 
that n — Aj )Tl < n' < n. O 

Claim 3: Let /ij be an element of the list .A4 and be as in Claim 12 If Uj is not a measure according 
to Definition [TOl then for every n > n« we have A$ n > n — ni for all n> rii and Aj n = Aj n _i + 1 for 
all n > iii. If //j is a measure according to Definition [TOJ, then for every n there is a greatest c < oo such 
that Aj >r j < n — c and c goes to oo with growing n. We have Aj >n < max{A(i, n — 1), . . . , A(i, 1)} + 1 
for infinitely many n iff ^ is a measure. 

Proof: If /Uj is not a measure then by Claim[2]there is a z G L* such that y^{z) — ^2 aeL ^(za) > 
for all n > nj. This z will never leave the sets Zij :Tl (\z\ < j = n — i) for n > m. Therefore, Aj ra > n — rii 
and Aj jn = Aj jn _i + 1 for all n > ni. 

If |Uj is a measure then lim n _ >00 (/u™(z) — J2 a eL l I ?( za )) = 0- Therefore, for every z there is a least 
nj )Z < oo such that fJ,f(z) — SaGL/ i ?( za ) = ^ for all n > rii tZ . Hence, for every j and every z € L* 
with |z| < j we have z ^ij,n for all n > rij jZ . Thus, every finite string in Zi j n is not a member 
of Zij+n'-njii any more for every n' > n. Therefore, for every n there is a greatest c < oo such that 
Aj n < n — c and c goes to oo with growing n. This implies that 

A ijn < max{A(i, n - 1), . . . , A(i, 1)} + 1 

for infinitely many n. In view of the above property of semimeasures that are not measures according to 
Definition [TOJ \i{ is a measure iff the last displayed equation holds. ■ 
Remark 8 : To make everything effective (computable) for Algorithm 3 (Figure [4]) we do not use 
prefix complexity as in (1A.41) but the time-bounded analog as defined. By using dovetailing, that is, 
n = 1, 2, . . . with all combinations of i, j > such that i + j = n and n is the number of steps, as 
in Definition [TT] with growing n every /J-f(z) with \z\ < j for every particular i,j,n is computed and 
considered. 
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In Algorithm 3 one wants to determine the indexes i of elements pi in the list M. such that pi is a 
measure. This happens in Steps 2 and 3 as follows. For growing n the fact that index i is selected to 
go in / means that p^ is possibly a measure. Eventually, if n is large anough this possibility will turn 
into a certainty. Moreover, this will hold for every measure p^ Thus, for every i with growing n, but 
not computably, it is decided that pi is a measure or not. If it is a measure, then it keeps on figuring in 
the second part of the algorithm, if it is decided not to be a measure then it will not figure in the second 
part of the algorithm. 

This second part of the algorithm determines for which measures the data xi, X2, ■ ■ ■ is random. Note 
that this part initially also may consider nonmeasures. But with growing n, because of the first part of 
the algorithm, for every index i it will consider only measures but not nonmeasures pi. If for some 
n the index % is selected then in Step 5 the algorithm computes the n-approximations p(i,j',n) := 
log l/fj,f(x% . . . Xj>) — K n {x\ . . . Xji) (1 < j' < j, j = n — i). This in order to obtain approximations to 
the elements constituting the initial segment of the sequence of which equation (IA.4b takes the supremum. 
In Step 6 the algorithm takes the maximum a(i,n,f) := m&x{p(i, j" , n) : 1 < j" < j'} over the initial 
segments of this initial segment. In Step 7 it determines for every pi concerned how long the longest flat 
plateau of this sequence of maxima is. 

Suppose /jLi is a measure for which x\,X2,--- is random. If there exists a no such that the no- 
approximation of p{i,jo,no) has reached the supremum in (IA.4I ). then for every j > jo an d n > no 
we have that p(i,j,n) reaches this supremum. (Reaching means "is within a unit.") To select a Uj one 
looks for the p(i, •, •) that gives the longest flat plateau, that is, has reached the supremum in (IA.4I ) the 
soonest in terms of the initial segment of x\,X2, ■ ■ ■ ■ Thus, in Step 7 the algorithm compares the length 
of the flat plateau with the top score, and changes the latter if it is exceeded. In Step 8 the algorithm 
either selects the index resulting in a new or equal top score or goes with the index of approximation 
n — 1. Note that with growing n nonmeasures are excluded in the first part of the algorithm. Thus, with 
growing n, the measure pi that reaches (IA.4b soonest and has the longest flat plateau, has an index i that 
is not (eventually) excluded by the first part of the algorithm. O 

Definition 12: Let n, in list M. be a measure according to Definition [TOl and xi, X2, ■ ■ ■ be an infinite 
sequence of elements from L. By (|A.4b we have pi 6 M.(x) iff there is a <Tj < oo such that 




(A.6) 



for 1 < j < oo. Define nij as the least j for which oi is reached. 
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Algorithm (x±, X2, ■ ■ ■): 
Step 1 m := 0; for n = 1, 2, . . . execute Steps 2 through 8. 

Step 2 7 := 0; for every i, j > satisfying i + j = n, compute Aj n (note j = n — i) and execute 
Steps 3 through 7. 

Step 3 if A it n < max{A(i, n - 1), . . . , A(i, 1)} + 1 then I := I {J{i} (by Claim [3 i G 7 for infinitely 

many n iff pi is a measure). 
Step 4 if 7 ^ then execute Steps 5 through 7 for every £ G 7. 
Step 5 for j' := 1,..., j set p(i,j',n) := logl/^(xi . ..Xy) - 7T n (xi . ..a^). 
Step 6 for j' := 1, . . . , j set a(i,n,j') := max{p(i,j",n) : 1 < j" < j'}. 

Step 7 s(i, n) := max{s : [cr(i, n, r)J — 1 = • • • = re > r + s )J — 1> 1 < r < r + s < j}; if 

s(i,n) < m then 7 := else m := s(i,n). 
Step 8 if 7 ^ then i n := min{i : s(i, n) = m} else i ra := i n _i; output i ra . 

Fig. 4. Algorithm 3 

Remark 9: In Definition [12] we have replaced "measure" by "measure according to Definition [TO]" 
We have replaced the "sup" in dA.41 ) by "max" by rounding down. Moreover, by rounding down and 
subtracting 1 we have taken care that rrn < oo. O 

(Proof of the theorem continued.) The sequence of outputs of Algorithm 3 are indexes ii, «2j - - - °f 
lower semicomputable semimeasures in M., such that possibly ij < ij+i, ij > ij+i, or ij = By 
Claim [3l for every measure (Definition ITQb in M. we have i G 7 in Step 3 of the algorithm for 
infinitely many n. For large enough n index i £ I of a nonmeasure p{. So for every index i there is a 
large anough n such that i G 7 only if /u» is a measure. These measures are treated in Steps 4 through 8. 
By assumption the data x = x±,X2, ... is random (typical) for some measure in M.. Let this be measure 
Pk- 

Let us look at long plateaus s(i,n) for measures pi such that either x = X\,X2, ... is not random to 
it or nii > n (with mi as in Definition [T2l). For a measure pi such that x = x\, X2, ■ ■ ■ is not random to 
it, s(i,n) can be any constant c. However the lefthand side of dA.41 ) goes to infinity in this case, so we 
know that s(i, n') = 1 for some n' > n. Since the data x = xi,X2, ... is random to pu, for all n that are 
large enough s(k, n) = s(k, n — 1) + 1 by Claim [3] Hence, for large enough n we have s(k, n) > s(i, n), 
since s(k, n) — > oo with n — > oo. For a measure pj with j ^ k such that x = xi,X2, ... is random to it, 
s(j, n) can be a constant c < Oj while n < mj. That is, the maximum in (IA.6I ) has not yet been reached. 
Again we know that s(j,n') = 1 for some n' > n. Hence without loss of generality we can exclude 
cases like measures pi and pj. Let us consider only measures pi and steps n, such that x = x±,X2, ■ ■ ■ 
is random to pi and n > mi. 
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Without loss of generality we can assume that we compute s(i,n) for every measure /ij in M. and 
every n according to Steps 6 and 7, and not just when i € I. In particular we can do so for By 
Claim [3l for every n that is large enough we have s(k, n) < s(i, k — 1) + 1 and s(/c, n) = s(k, n — 1) + 1. 
Let i satisfy nrik < i- Then, s(i,mk) = (the set over which s(i,rrik) is maximized in Step 7 equals 
0). Since s(i,n) < s(i,n) + 1 for all n, we have s(i,n) < s(k,n) < m (with m as in Step 7) for all 
H > m-ic- Hence i n ^ i in Step 8 if n is large enough. Let A be the set of measures fn with i < m^. 
Then [A| < and Afoo C A Hence |Moo| < < oo. 

Since (Mooj < m^, and by Steps 5,6,7 the output of Algorithm 3 is an infinite sequence . . . , we 
have that some i$ < mp. occurs infinitely often in this sequence. Hence ^ 0. 

Let the index i of a measure \ii occur in M^. Then s(i, n) is larger than m in Step 7 for infinitely 
many n. (It is impossible that i n = i n -i for an infinitely long run of n's since m — > oo with n — > oo. 
The latter statement is a consequence of s(k,n) growing with n.) Since m — > oo with n^oowe have 
n) — ?> oo with n — > oo. By the definition of s(-, •) and (IA.6I ). the data x = xi,X2, ... is random with 
respect to the measure ^j. Hence, C M(x). This proves the theorem. ■ 
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